{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/CoreTheGreat/HBPU-Machine-Learning-Course/blob/main/ML_Chapter4_Clustering.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lPboLx_o0UxI"
   },
   "source": [
    "# 第四章：聚类\n",
    "湖北理工学院《机器学习》课程资料\n",
    "\n",
    "作者：李辉楚吴\n",
    "\n",
    "笔记内容概述: 电影评分分析中的聚类问题及应用\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WYv0kOOZWEaT"
   },
   "source": [
    "Moves Data File Structure (movies.csv)\n",
    "\n",
    "Genres: (no genres listed), 动作（Action）, 冒险（Adventure）, 动画（Animation）, 儿童（Children）, 喜剧（Comedy）, 犯罪（Crime）, 纪录片（Documentary）, 剧情（Drama）, 奇幻（Fantasy）, 黑色电影（Film-Noir）, 恐怖（Horror）, IMAX, 音乐（Musical）, 推理（Mystery）, 爱情（Romance）, 科幻（Sci-Fi）, 惊悚（Thriller）, 战争（War）, 西部（Western）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "1JbjUpmKUhuv",
    "outputId": "0fb8b51a-97ed-4855-cb89-71528b2d08ed"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "movies = pd.read_csv('./Data/movies.csv')\n",
    "print(f'Movie number: {len(movies)}')\n",
    "movies.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Statistic genres\n",
    "genres_list = []\n",
    "for genres in movies['genres']:\n",
    "    genres_list.extend(genres.split('|'))\n",
    "genres_list = np.sort(np.unique(genres_list))\n",
    "print(f'Movie genres: {genres_list}')\n",
    "\n",
    "label_size = 18 # Label size\n",
    "ticklabel_size = 14 # Tick label size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "XLSroDN4Y7CY"
   },
   "source": [
    "Rating Data File Structure (ratings.csv)\n",
    "\n",
    "Each line of this file after the header row represents one rating of one movie by one user, and has the following format:\n",
    "\n",
    "userId, movieId, rating, timestamp\n",
    "\n",
    "Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).\n",
    "\n",
    "Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 224
    },
    "id": "05vhvNwyUujv",
    "outputId": "3c3085a2-7acc-4f0c-df1c-7b343ff71aae"
   },
   "outputs": [],
   "source": [
    "ratings = pd.read_csv('./Data/ratings.csv')\n",
    "user_list = np.sort(np.unique(ratings['userId']))\n",
    "print(f'{len(user_list)} have provided {len(ratings)} rate records')\n",
    "ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fdYsKkr4gD1d"
   },
   "source": [
    "### 用DBSCAN进行聚类，分析主要用户群体\n",
    "\n",
    "统计用户对各种电影类型的喜爱程度，即计算用户对各类电影的平均分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "kK7lw-OTEqM3",
    "outputId": "508f6d21-f647-4ab0-a497-12cd075d7330"
   },
   "outputs": [],
   "source": [
    "# Create an array to save rates of all genres\n",
    "user_genres_rate = np.zeros((len(user_list), len(genres_list)))\n",
    "\n",
    "# Create an array to save movie rating counts of all genres\n",
    "# To compute average rates\n",
    "user_genres_rate_counts = np.zeros((len(user_list), len(genres_list)))\n",
    "\n",
    "for i in range(len(ratings)):\n",
    "    # User ID start from 1\n",
    "    # Convert user ID to user_idx by minus 1\n",
    "    user_idx = int(ratings.iloc[i]['userId'] - 1)\n",
    "    \n",
    "    # Get rate\n",
    "    rate = ratings.iloc[i]['rating']\n",
    "\n",
    "    # Get target movie\n",
    "    movie_id = ratings.iloc[i]['movieId']\n",
    "\n",
    "    # Split movie genres string\n",
    "    genres_type = movies[movies['movieId'] == movie_id]['genres'].values[0]\n",
    "    genres_type = genres_type.split('|')\n",
    "\n",
    "    # Statistic rates of movie genres\n",
    "    for genres in genres_type:\n",
    "        # Using '[0][0]' to get index from a list tuple\n",
    "        # First [0] get the index list outputed by np.where()\n",
    "        # Second [0] get the first item of list (with single item)\n",
    "        genres_idx = np.where(genres_list == genres)[0][0]\n",
    "\n",
    "        # Sum rate\n",
    "        user_genres_rate[user_idx, genres_idx] += rate\n",
    "        \n",
    "        # Count movie number of the genres\n",
    "        user_genres_rate_counts[user_idx, genres_idx] += 1\n",
    "\n",
    "# Compute genres rates of all users\n",
    "user_genres_rate = user_genres_rate / user_genres_rate_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用散点图展示动作电影（Action）和动画电影（Animation）的用户喜爱情况\n",
    "\n",
    "可选电影类型: \n",
    "\n",
    "'(no genres listed)' 'Action' 'Adventure' 'Animation' 'Children' 'Comedy'\n",
    "\n",
    "'Crime' 'Documentary' 'Drama' 'Fantasy' 'Film-Noir' 'Horror' 'IMAX'\n",
    "\n",
    "'Musical' 'Mystery' 'Romance' 'Sci-Fi' 'Thriller' 'War' 'Western'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 1000
    },
    "id": "KaY4gClVJB6v",
    "outputId": "8527cef2-c3bb-4ca0-e584-9e328a88d557"
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "genres_1 = 'Action' # Define first genres\n",
    "genres_2 = 'Animation' # Define second genres\n",
    "\n",
    "idx_g1 = np.where(genres_list == genres_1)[0][0] # Get genres 1 index in the genres_list\n",
    "idx_g2 = np.where(genres_list == genres_2)[0][0] # Get genres 2 index in the genres_list\n",
    "\n",
    "# Get list of user who have watched movies including both genres_1 and genres_2\n",
    "# Idea: People who have watched related movies are more credible in determining whether they like them or not.\n",
    "idx_both = np.where(np.logical_and(user_genres_rate[:,idx_g1] > 0, user_genres_rate[:,idx_g2] > 0))[0]\n",
    "print(f'Number of users who watched both {genres_1} and {genres_2}: {len(idx_both)}')\n",
    "\n",
    "# Filter rating information of users who have watched both genres_1 and genres_2\n",
    "x = user_genres_rate[idx_both][:, [idx_g1, idx_g2]]\n",
    "print(f'Shape of x: {x.shape}')\n",
    "\n",
    "# Drawing rates distribution by scatter figure\n",
    "fig, ax = plt.subplots(figsize=(6,6))\n",
    "\n",
    "ax.scatter(x[:,0], x[:,1], marker=\"o\", c='white', s=10**2, edgecolor=\"k\")\n",
    "\n",
    "ax.tick_params(axis='both', which='major', labelsize=ticklabel_size) # Set tick label size\n",
    "ax.set_xlabel(f'Average rate of {genres_1}', fontsize=label_size)\n",
    "ax.set_ylabel(f'Average rate of {genres_2}', fontsize=label_size)\n",
    "ax.set_xlim([-0.1, 5.1])\n",
    "ax.set_ylim([-0.1, 5.1])\n",
    "ax.tick_params(axis='both', which='major', labelsize=ticklabel_size) # Set tick label size\n",
    "# plt.savefig(f'rate_of_interested_movies.png', dpi=300) # Make figure clearer\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用曼哈顿距离计算各点之间的距离矩阵\n",
    "\n",
    "* 可用循环的方式实现，但是运行速度比较慢"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert x to 3-D by adding a new axis\n",
    "x_exp = x[:, np.newaxis]\n",
    "print('The shape of x_exp is', x_exp.shape)\n",
    "\n",
    "# Compute diff vector matrix by numpy's broadcast machanism\n",
    "x_diff = x[:, np.newaxis] - x\n",
    "print('The shape of x_diff', x_diff.shape)\n",
    "\n",
    "# Sum difference to compute Manhattan distance\n",
    "distance_map = np.abs(x_diff).sum(axis=2)\n",
    "print('The shape of distance_map', distance_map.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "绘制min-pts为3时的k-Distance图，判断合适的eps所处区间"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 551
    },
    "id": "TGSPq0ZlwU_V",
    "outputId": "851d99ed-fc4e-4768-e9ad-78e6c0b624a3"
   },
   "outputs": [],
   "source": [
    "# Using k-Distance to find potential values of e\n",
    "min_pts = 3\n",
    "\n",
    "# k-Distance\n",
    "k_distance = np.sort(np.sort(distance_map, axis=1)[:, min_pts])[::-1]\n",
    "\n",
    "# Draw k-Distance figure\n",
    "fig, ax = plt.subplots(figsize=(10,6))\n",
    "ax.plot(np.arange(1, len(k_distance)+1), k_distance, marker='o', linestyle='-', color='tab:blue')\n",
    "ax.set_xlabel('Sample Index', fontsize=label_size)\n",
    "ax.set_ylabel('k-Distance', fontsize=label_size)\n",
    "ax.tick_params(axis='both', which='major', labelsize=ticklabel_size) # Set tick label size\n",
    "# plt.savefig(f'rate_dbscan_kdistance.png', dpi=300)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QNqkoRaCy80z"
   },
   "source": [
    "令eps为0.25进行密度聚类，获取主要受众群体"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 902
    },
    "id": "AOOxlGcUjNkd",
    "outputId": "24bd1254-b437-48cc-9cae-120ae3f3628b"
   },
   "outputs": [],
   "source": [
    "from sklearn.cluster import DBSCAN\n",
    "\n",
    "eps = 0.25\n",
    "\n",
    "# Define DBSCAN clustering model\n",
    "mdl_dbscan = DBSCAN(eps=eps, min_samples=min_pts)\n",
    "\n",
    "# Train k-Means model\n",
    "mdl_dbscan.fit(x)\n",
    "\n",
    "labels_dbscan = mdl_dbscan.labels_\n",
    "\n",
    "# Get the biggest cluster from labels_dbscan\n",
    "# Filter out noise points (labels = -1) before using np.bincount()\n",
    "biggest_cluster_label = np.argmax(np.bincount(labels_dbscan[labels_dbscan != -1]))\n",
    "print(f'Biggest cluster label: {biggest_cluster_label}')\n",
    "\n",
    "# Plot scatter of average rates\n",
    "fig, ax = plt.subplots(figsize=(10,10))\n",
    "\n",
    "# Draw noise points\n",
    "point_idx = (labels_dbscan == biggest_cluster_label)\n",
    "ax.scatter(x[point_idx,0], x[point_idx,1], marker=\"o\", c=\"k\", s=10**2, edgecolor=\"k\", zorder=0)\n",
    "\n",
    "ax.tick_params(axis='both', which='major', labelsize=ticklabel_size) # Set tick label size\n",
    "ax.set_xlabel(f'Average rate of {genres_1}', fontsize=label_size)\n",
    "ax.set_ylabel(f'Average rate of {genres_2}', fontsize=label_size)\n",
    "ax.set_xlim([-0.1, 5.1])\n",
    "ax.set_ylim([-0.1, 5.1])\n",
    "ax.set_title(f'DBSCAN clustering {biggest_cluster_label}, eps = {eps}', fontsize=label_size)\n",
    "ax.tick_params(axis='both', which='major', labelsize=ticklabel_size) # Set tick label size\n",
    "# plt.savefig(f'rate_dbscan_cluster_{biggest_cluster_label}.png', dpi=300) # Make figure clearer\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用协同过滤对主要群体做个性化推荐"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "50e__pyHmuKT"
   },
   "source": [
    "根据聚类结果以最大类（主要群体）构建新数据集user_movie_rate_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get rate of users in cluster biggest_cluster_label\n",
    "user_ratings = ratings[ratings['userId'].isin(idx_both[labels_dbscan == biggest_cluster_label])]\n",
    "print('Number of users:', len(user_ratings['userId'].unique()))\n",
    "\n",
    "# Get list of movies which have been watched by users in user_ratings\n",
    "watched_movies = user_ratings['movieId'].unique()\n",
    "print('Number of watched movies:', len(watched_movies))\n",
    "\n",
    "# Filter out movies which are not genres_1 nor genres_2 from watched_movies\n",
    "filtered_movies = movies[movies['movieId'].isin(watched_movies) & \n",
    "                         (movies['genres'].str.contains(genres_1) | \n",
    "                          movies['genres'].str.contains(genres_2))]\n",
    "print('Number of filtered movies:', len(filtered_movies))\n",
    "\n",
    "# Only select 450 movies just to reduce computation time\n",
    "filtered_movies = filtered_movies.head(450)\n",
    "print('Number of selected movies:', len(filtered_movies))\n",
    "\n",
    "# Union rates and filtered_movies tables by 'movieId'\n",
    "user_movie_rate_table = pd.merge(user_ratings, filtered_movies, on='movieId')\n",
    "user_movie_rate_table.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由user_movie_rate_table生成用户相似度表user_movie_rate_array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get userID list\n",
    "user_list = user_movie_rate_table['userId'].unique()\n",
    "print(f'User number is {len(user_list)}')\n",
    "\n",
    "# Get movieID list\n",
    "movie_list = user_movie_rate_table['movieId'].unique()\n",
    "print(f'Movie number is {len(movie_list)}')\n",
    "\n",
    "# Generate user_movie_rate_array\n",
    "# Create a 2D array to store user-movie ratings\n",
    "user_movie_rate_array = np.zeros((len(user_list), len(movie_list)))\n",
    "\n",
    "# Fill the array with ratings\n",
    "for index, row in user_movie_rate_table.iterrows():\n",
    "    user_index = np.where(user_list == row['userId'])[0][0]\n",
    "    movie_index = np.where(movie_list == row['movieId'])[0][0]\n",
    "    \n",
    "    user_movie_rate_array[user_index, movie_index] = row['rating']\n",
    "\n",
    "print(f\"Shape of user_movie_rate_array: {user_movie_rate_array.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用热力图展示user_movie_rate_array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# Set up the plot\n",
    "plt.figure(figsize=(10, 5))\n",
    "\n",
    "# Create the heatmap\n",
    "sns.heatmap(user_movie_rate_array, cmap='coolwarm', xticklabels=False, yticklabels=False)\n",
    "\n",
    "# Set title and labels\n",
    "plt.xlabel('Movies')\n",
    "plt.ylabel('Users')\n",
    "\n",
    "# Show the plot\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "记录所有未评分的电影"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a dictionary to store unrated movies for each user\n",
    "unrated_movies = {}\n",
    "unrated_movies_count = 0\n",
    "print(user_movie_rate_array.shape)\n",
    "# Iterate through each user\n",
    "for user_idx, user_ratings in enumerate(user_movie_rate_array):\n",
    "    # Find indices of unrated movies (where rating is 0)\n",
    "    unrated_indices = np.where(user_ratings < 0.1)[0]\n",
    "    \n",
    "    # Store the unrated movie IDs for this user\n",
    "    # Keys and values are the index of user_movie_rate_array\n",
    "    unrated_movies[user_idx] = unrated_indices\n",
    "    \n",
    "    if  isinstance(unrated_indices, int):        \n",
    "        unrated_movies_count += 1\n",
    "    else:\n",
    "        unrated_movies_count += unrated_indices.size\n",
    "\n",
    "print(f\"Number of users with unrated user-movies: {unrated_movies_count}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用余弦相似度，由user_movie_rate_array生成sim_matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "# Compute cosine similarity matrix\n",
    "sim_matrix = cosine_similarity(user_movie_rate_array)\n",
    "print(f\"Size of sim_matrix is: {sim_matrix.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "使用sim_matrix预测电影评分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function to predict ratings for unrated movies\n",
    "def predict_ratings(user_idx, unrated_indices, sim_matrix, user_movie_rate_array, k=10):\n",
    "    # Get the similarity scores for the current user\n",
    "    user_similarities = sim_matrix[user_idx]\n",
    "    \n",
    "    # Sort similarities in descending order and get top k similar users\n",
    "    similar_users = np.argsort(user_similarities)[::-1][1:k+1]\n",
    "    \n",
    "    predicted_ratings = []\n",
    "    \n",
    "    for movie_idx in unrated_indices:\n",
    "        # Get ratings of similar users for this movie\n",
    "        similar_user_ratings = user_movie_rate_array[similar_users, movie_idx]\n",
    "        \n",
    "        # Filter out users who haven't rated this movie\n",
    "        valid_ratings = similar_user_ratings[similar_user_ratings > 0]\n",
    "        valid_similarities = user_similarities[similar_users][similar_user_ratings > 0]\n",
    "        \n",
    "        if len(valid_ratings) > 0:\n",
    "            # Weighted average of ratings\n",
    "            predicted_rating = np.sum(valid_ratings * valid_similarities) / np.sum(valid_similarities)\n",
    "        else:\n",
    "            # If no similar user has rated this movie, use the global mean rating\n",
    "            predicted_rating = np.mean(user_movie_rate_array[user_movie_rate_array > 0])\n",
    "        \n",
    "        predicted_ratings.append(predicted_rating)\n",
    "    \n",
    "    return predicted_ratings\n",
    "\n",
    "# Predict ratings for all users\n",
    "predicted_ratings = {}\n",
    "for user_idx, unrated_indices in unrated_movies.items():\n",
    "    predicted_ratings[user_idx] = predict_ratings(user_idx, unrated_indices, sim_matrix, user_movie_rate_array)\n",
    "\n",
    "# Print some statistics\n",
    "total_predictions = sum(len(ratings) for ratings in predicted_ratings.values())\n",
    "print(f\"Total number of predictions made: {total_predictions}\")\n",
    "print(f\"Average predicted rating: {np.mean([rating for user_ratings in predicted_ratings.values() for rating in user_ratings]):.2f}\")\n",
    "\n",
    "# Update user_movie_rate_array with predicted ratings\n",
    "for user_idx, user_predictions in predicted_ratings.items():\n",
    "    user_movie_rate_array[user_idx, unrated_movies[user_idx]] = user_predictions\n",
    "\n",
    "print(\"Updated user_movie_rate_array with predicted ratings.\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up the plot\n",
    "plt.figure(figsize=(10, 5))\n",
    "\n",
    "# Create the heatmap\n",
    "sns.heatmap(user_movie_rate_array, cmap='coolwarm', xticklabels=False, yticklabels=False)\n",
    "\n",
    "# Set title and labels\n",
    "plt.xlabel('Movies')\n",
    "plt.ylabel('Users')\n",
    "\n",
    "# Show the plot\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyO+9EvQyLwzW6sRWnzqnjPu",
   "collapsed_sections": [
    "c-Xj4WtjPdwD"
   ],
   "include_colab_link": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
